TIMIT-TTS: A Text-to-Speech Dataset for Multimodal Synthetic Media Detection
نویسندگان
چکیده
With the rapid development of deep learning techniques, generation and counterfeiting multimedia material has become increasingly simple. Current technology enables creation videos where both visual audio contents are falsified. While forensics community begun to address this threat by developing fake media detectors. However, vast majority existing forensic techniques only analyze one modality at a time. This is an important limitation when authenticating manipulated videos, because sophisticated forgeries may be difficult detect without exploiting cross-modal inconsistencies (e.g., across tracks). One reason for lack multimodal detectors similar research datasets containing forgeries. Existing typically contain falsified modality, such as deepfaked with authentic tracks, or synthetic no associated video. Currently, needed that can used develop, train, test these algorithms. In paper, we propose new audio-visual deepfake dataset video We present general pipeline synthesizing speech content from given video, facilitating counterfeit material. The proposed method uses Text-to-Speech (TTS) Dynamic Time Warping (DTW) achieve realistic tracks. use generate release TIMIT-TTS, most cutting-edge methods in TTS field. standalone dataset, combined DeepfakeTIMIT VidTIMIT perform research. Finally, numerous experiments benchmark monomodal (i.e., audio) video) conditions. highlights need more data.
منابع مشابه
A Taiwanese (min-nan) text-to-speech (TTS) system based on automatically generated synthetic units
A Taiwanese (Min-nan) Text-to-Speech (TTS) system has been constructed in this paper based on automatically generated synthetic units by considering several specific phonetic and linguistic characteristics of Taiwanese. Some basic facts about Taiwanese useful in a TTS system is summarized, including the issues of tone sandhi, the writen format and the others. Three functional modules, namely a ...
متن کاملUSC-TIMIT: A database of multimodal speech production data
USC-TIMIT is a speech production database under ongoing development, which currently includes real-time magnetic resonance imaging data from five male and five female speakers of American English, and electromagnetic articulography data from five of these speakers. The two modalities were recorded in two independent sessions while the subjects produced the same 460 sentence corpus. In both case...
متن کاملA Multimodal Dataset for Deception Detection
This paper presents the construction of a multimodal dataset for deception detection, including physiological, thermal, and visual responses of human subjects under three deceptive scenarios. We present the experimental protocol, as well as the data acquisition process. To evaluate the usefulness of the dataset for the task of deception detection, we present a statistical analysis of the physio...
متن کاملCHULA TTS: A Modularized Text-To-Speech Framework
Spoken and written languages evolve constantly through their everyday usages. Combining with practical expectation for automatically generating synthetic speech suitable for various domains of context, such a reason makes Text-to-Speech (TTS) systems of living languages require characteristics that allow extensible handlers for new language phenomena or customized to the nature of the domains i...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE Access
سال: 2023
ISSN: ['2169-3536']
DOI: https://doi.org/10.1109/access.2023.3276480